Ensemble Techniques Project

Part A

A telecom company wants to use their historical customer data to predict behaviour to retain customers. You can analyse all relevant customer data and develop focused customer retention programs.

Importing modules for project

Read ‘TelcomCustomer-Churn_1.csv’ as a DataFrame and assign it to a variable.

Read ‘TelcomCustomer-Churn_2.csv’ as a DataFrame and assign it to a variable

Merge both the DataFrames on key ‘customerID’ to form a single DataFrame

Verify if all the columns are incorporated in the merged DataFrame by using simple comparison Operator in Python.

we can see that columns form df1 are merged in df

we can see that columns form df2 are merged in df

Impute missing/unexpected values in the DataFrame

Null values are droped from dataframe

Make sure all the variables with continuous values are of ‘Float’ type

monthlycharges,totalcharges,tenure are continous values changing its datatype to flot.

Create a function that will accept a DataFrame as input and return pie-charts for all the appropriate Categorical features. Clearly show percentage distribution in the pie-chart.

Share insights for Q2.c

Genderwise data is seprated almost equally.

There are 30% dependents in data.

90.3% have phoneservices.

out of the them 42.2% have multiple lines.

44.4% people uses fiber optic internet services 34.4% people use DSL and 21.6% people do not have internet services.

out of the people who have internet services 28.7% have online security.

34.5% have online backup 34.4% have device protection 29% have techsupport.

38.4% have streaming tv services.

38.8% have streaming movies services.

out of three types of contract most of the people opt for month to month subcribtion.

60% people choose paperless billing.

electronic mail is most prefered payment method.

there is little biasedness in churning 73.4% do not churn.

Encode all the appropriate Categorical features with the best suitable approach

Split the data into 80% train and 20% test.

Normalize/Standardize the data with the best suitable approach.

Train a model using XGBoost. Also print best performing parameters along with train and test performance.

accuracy on testing data is 77% and its 94% on traning data this is because its overfitting the data.

contract is acting as root node in the tree. contract has more important then any other columns.

We can can perform hyperparameter tunning to get the optimal results.

Improve performance of the XGBoost as much as possible. Also print best performing parameters along with train and test performance.

removing columns with low importance

after optimizing the model we can see that accuracy increased to 79% .

most important parameter is still contract.

training accuarcy is also droped suggesting data is not overftting.

Part B

  1. Build a simple ML workflow which will accept a single ‘.csv’ file as input and return a trained base model that can be used for predictions. You can use 1 Dataset from Part 1 (single/merged).
  2. Create separate functions for various purposes.
  3. Various base models should be trained to select the best performing model.
  4. Pickle file should be saved for the best performing model.

Include best coding practices in the code: • Modularization • Maintainability • Well commented code etc.

after calling the main function we gave input of merged data from part A question.

after running various models we found that logistic regression have the best accuarcy for the model.